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Abstract 

In this paper we initiate the study of optimization of bandit type problems in scenarios where 
the feedback of a play is not immediately known. This arises naturally in allocation problems 
which have been studied extensively in the literature, albeit in the absence of delays in the 
feedback. We study this problem in the Bayesian setting. In presence of delays, no solution 
with provable guarantees is known to exist with sub-exponential running time. 

We show that bandit problems with delayed feedback that arise in allocation settings can 
be forced to have significant structure that gives us the ability to reason about this policy. We 
show a O(l) approximation for a significantly general class of priors. The structural insights 
we develop are of key interest and carry over to the setting where the feedback of an action is 
available instantaneously. In particular, we show a simple 2-approximation for the finite horizon 
Bayesian bandit problem, improving and generalizing prior work. 
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1 Introduction 



In this paper, we consider the problem of iterated allocation of resources, when the effectiveness of 
a resource is uncertain, and is learnt after some delay. Allocation of resources under uncertainty is a 
central problem in a variety of disciplines, notably in learning [7] and stochastic control [6j. In these 
problems, we are asked to make a series of allocation decisions, based on past outcomes. Since the 
seminal contributions of Wald [31] and Robbins [26], a vast literature, including both optimal and 
near optimal solutions, has been developed, see references in [31 [301 El E] • Of particular interest is 
the celebrated Multi- Armed Bandit (MAB) problem, where an agent decides on allocating resources 
between competing actions (arms) with uncertain rewards and can only take one action at a time 
(play the arm). The agent collects the reward and the state of the played arm is updated. 

However, an overwhelming majority of the literature focuses on scenarios that assume instan- 
taneous (or negligible in comparison to the horizon) revelation of the outcome of each allocation. 
In an early work in mid 1960s, Anderson [2j] introduced the notion of delayed observations. Since 
then, though there have been additional results [29, 9J, a theoretical guarantee on adaptive decision 
making under delayed observations has been elusive, and the computational difficulty in obtaining 
such has been commented upon in [121 S [281 03] • Recently, this issue of delays has been thrust to 
the fore due to the increasing application of iterative allocation problems in online advertising. 

1.1 Problem Statement and Motivating Examples 

We now concretely define the Bayesian multi-armed bandit problem with delayed feedback. There 
is a bandit with n independent arms. When arm i is played, the reward is drawn i.i.d. from 
distribution Di, which is unknown. However, a prior Pj is specified over possible Dj. When arm i 
is played, the feedback about its reward outcome is learned only after 5i steps. The set of observed 
outcomes so far resolves the prior to a posterior distribution according to Bayes' rule. We also 
assume each arm has budget Bi on the maximum reward it can accrue; further plays do not accrue 
reward. In this setting, the goal is to design a decision policy for allocating the plays to the arms. 
A decision policy is a mapping from the current state, determined by the posterior distributions of 
each arm; the plays for which feedback is outstanding; and the remaining budgets of each arm to an 
action of which arm to play. There is a horizon of T steps, and our goal is to design a polynomial 
time algorithm that outputs a decision policy maximizing the expected revenue. 

As concrete instantiations of this framework and a running example of the space of problems 
we address, consider the following three examples: 

1. Online Determination of Web Content: This is described by Agarwal etal in [IJ. The goal 
is to present different websites/snippets to an user and induce the user to visit these pages. For 
each user (type) j there is prior information about the propensity of the user to be interested in 
content i. This propensity is described by a distribution D^j drawn from a prior distribution T>ij. 
The priors are constructed from historical estimation. As the pages are displayed, we resolve this 
prior; however, the characterization of the goodness a visit corresponds to multiple signals which 
is only known after a delay 5ij. The delay can also arise from batched updates and systems issues. 
Note that there a number of "budget constraints" where each page can be displayed Tj times. 
For one user, this maps to the problem described above. The authors of [T] present a number of 
different heuristics based on greedy policies. 

2. Unmanned Aerial Vehicles: Bandit problems are used widely in stochastic control, where 
often actions are not immediate in their feedback. For example the difficulty in routing multiple 
UAVs among an uncertain terrain is the remoteness of the controller [23], which induces delays. 
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3. Budgeted Allocation Problem: In this problem, there are n bidders (advertisers competing 
for a collection of m impressions (advertisement slots). Assume that the impressions arrive one at 
a time, and correspond to the "semi-online" framework where the number of impressions of type 
j is Tj and known beforehand. Each bidder i has a budget Bi on the total amount she is willing 
to pay, and bids bij for impression of type j. When advertiser i is allocated an impression of type 
j, she is charged her bid only if an acquisition occurs. This is referred to as the cost-per-action 
(CVR) model. The search engine allocates the impressions based on its estimate of the click or 
conversion probability (CVR) pij of this favorable event, in order to maximize its expected revenue 
(where if a favorable event occurs for pair and i's budget is not exhausted, the contribution to 

revenue is bij). The classical budgeted allocation problem [23j [I5j El [H] assumes that (i) the p^ are 
known; and (ii) feedback about the click or conversions is instantaneous. Stochastic relaxations of 
the assumption (i) is already considered in |25j . who also provide simulations of greedy algorithms 
and consider budgets. But in case of conversions (and in case of more refined measurements of 
clicks as well) there is a natural delay in the feedback, which previous literature largely ignores. 

1.2 Our Results and Techniques 

We present a constant factor approximation to the Bayesian MAB problem with delayed feedback. 
Before presenting our results, it is instructive to study closely related work. The Bayesian MAB 
problem has been considered in the model where the feedback is instantaneous, see \27\ [22], \18\ 
117] . and constant factor approximations are known in this setting. In the absence of delays, all 
approximation algorithms (as well as the exact Gittins index algorithm in the discounted reward 
setting |16[ 130]) use the key observation that the posterior distribution of an arm does not change 
if not played. Using this observation, it was shown in [18] that there is an approximately optimal 
policy that sequentially plays the arms. In presence of delays, the state of an arm now needs to 
capture the plays with outstanding feedback, and this not only causes the state to change even with 
no plays, but also requires exponential space to specify the state even for a single arm. Finally, any 
policy has to now interleave plays of the different arms, and hence, existing techniques in [18 |, I17 |. [30] 
do not apply in any straightforward fashion. In a sense, the challenge in the delayed setting is not 
just to balance explorations along with the exploitation, but also to decide a schedule for the 
possible actions both for a single arm (due to delays) as well as across the arms. 

At a high level, our algorithm starts with the standard LP relaxation to the problem [6j [18] . This 
relaxation provides a collection of single- arm policies whose execution is confined to one bandit arm; 
the LP enforces constraints on the total expected plays these policies can make. Such a relaxation 
has been widely used in the absence of delays, since it leads to efficient decomposable decision 
policies (called index policies) that treats every arm independently of other arms. Note that the 
joint state space over all the arms is exponential in size. As alluded to above, in the presence of 
delays, it is not even clear that such decomposable policies are efficiently computable. The most 
important question that we ask for delayed feedback therefore is: Do there exist near optimum 
policies that are decomposable and efficiently computable? In this paper we develop new techniques 
to analyze single-arm policies and provide a poly-time algorithm that outputs a decomposable 
decision polic)o which is an 0(1) approximation of the possibly entangled optimum policy. 

In the rest of the paper we focus on the MAB case; we show in Appendix [A] how the results gen- 
eralizes to multiple MABs or users (budgeted allocation case). This generalization is a consequence 
of the fact that the algorithm uses the decomposition property (which implies linearilty). 

We summarize our main results and technical contributions below: 

1 These policies are very close to index policies; the policy of each arm is independent of other arms and hence 
efficiently computable; however, the policies of different arms cannot be reduced to a single priority value. 
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• In Section [3l we prove a structural result, Truncation Theorem. Theorem Q] shows that for 
arbitrary state spaces, even in the presence of budgets, delays and other constraints which 
preserves paths, a partially executed single-arm policy has reward proportional to the original 
policy. This uses a stopping time type argument and uses the connection between state spaces 
corresponding to priors and their Martingale properties. This idea leads to shortening the 
horizon of single-arm policies, and is very general; it directly yields a 2-approximation for the 
finite horizon MAB problem with instantaneous feedback. In the absence of budgets, the best 
known result for this problem was a 12 approximation [18} 117]: their approach is infeasible in 
the presence of budgets. Our result also shows a factor 2 approximation for the "irrevocable 
bandit" problernEI, improving the factor 8 result of Farias and Madan [14] . We also show that 
factor 2 is the best possible bound against LP relaxations over single arm policies. This result 
has also been used in subsequent results for MABs with nonlinear objective functions [19] . 

• In Section [H using the Truncation theorem, we show a O(l) approximation for the MAB 
problem with delayed feedback when S{ = o(T/logT)H Applying the Truncation theorem is 
not straightforward - a policy could make very few plays initially waiting for feedback after 
every play, and load all the plays at the very end when it is fairly certain the reward is 
large. To circumvent this, we develop an interesting compaction and simulation argument - 
we modify the policy to make more plays upfront, but withhold using the outcomes until the 
original policy used them. The resulting structure also implies that our policies are as efficient 
to compute as standard index policies (without delays) \16\ [6] . Finally, the combination of 
the single-arm policies into a final feasible policy requires a novel priority based scheme. 

The chief technical highlight of our work is a new way of accounting for the reward of a single-arm 
policy using the martingale property of priors that Bayes' rule entails. Traditionally, the Markov 
Decision Process (MDP) formulation resulted in the reward being accounted non-uniformly: given 
a posterior distribution, a play yields reward which is the expected value of the posterior. We use 
the fact that the plays draw i.i.d. rewards from a fixed underlying distribution (whose current belief 
is encoded in the prior), and hence the expected reward is the same for each play conditioned on 
this unknown distribution. We use this accounting for both the truncation and compaction steps, 
leading to simplicity of policies, clarity of analysis, and improved approximation bounds. 

1.3 Other Related Work 

There is an extensive literature on the MAB problem in the prior- free setting; see |21|, HI E]), and 
policies with additive regret guarantees are known. Regret is the difference between the expected 
reward of the policy and the reward of an omniscient policy which knows all the distributions. 
However, these results both require the reward rate to be large and large time horizon T compared 
the number of arms. In the application scenarios mentioned above, it will typically be the case 
that the number of arms is very large and comparable to the optimization horizon and the reward 
rates are low. This motivates the need for a purely multiplicative guarantee instead of additive 
guarantees. Moreover the analysis of these policies require the plays of the arm with maximum 
estimated reward to be continuous, which is not true in presence of delays (or budgets). If the 
delay of arm i satisfies 5i = o(n 1 / 3 T 2//3 ) (where n is the number of arms) then standard explore- 
then-exploit (where we play an arm long enough to start receiving the outcomes) strategy gives 
sub-linear in T regret. However, constants in the regret term depends on scaling of the rewards. 

2 The algorithm is not allowed to revisit an arm. 

3 We can show that using stochastic domination of Gittins-type indices, we can increase Si to O ^ log T ^ and still 
preserve the O(l) approximation of decomposable policies. However, the proof is very technical and omitted. 
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2 Preliminaries 



There is a bandit with n independent arms. The arm i underlying reward distribution Di, which 
is a random variable drawn from a prior distribution T>i. These priors are specified as input. The 
maximum possible reward that can be extracted from this arm is Bi, and though additional plays 
can be made, they do not accrue additional reward. If an arm is played, the feedback about the 
reward outcome is available only after 5{ time steps. As observations are available the successive 
posteriors (which serve as priors for the next trial), are produced by the Bayes' Rule. There is a 
time horizon of T plays. A decision policy specifies which arm to play given the current state of 
each arm, which is captured by the remaining budget and time horizon, posterior distribution, and 
plays with outstanding feedback for that arm. Each decision policy has a unique expected reward 
value that is obtained on executing it (with different execution trajectories differing due to different 
realizations of the underlying {Di}). Our goal is to design a poly-time algorithm, which outputs 
a policy that approximately maximizes the expected value derived over the horizon of T plays. 

Input: Priors and Posterior Spaces. For each arm i, the input specifies the space Si of possible 
posterior distributions, which defines a natural DAG, the root of which corresponds to the initial 
prior T>i. Every other state u £ Si corresponds to a set of observations, and hence to the posterior 
Xi u obtained by applying Bayes rule to T>i with the those observations. Playing the arm in state 
u yields a transition to state v with probability p uv , provided v can be obtained from u in one 
additional observation; the probability p uv is simply the probability of this observation conditioned 
on the posterior Xi u at u. The expected posterior mean at a state u G Si, denoted by r u = EpQj 
satisfies the martingale property r u = £V p U v r v An example is in Appendix IBl 

We need the running time to be poly(n,T, ]TV which is comparable to the running times 
(via per-arm dynamic programming) for computing the standard index policies [6l I16j. 

Budgets: The budget Bi for arm i can be folded into the description of the posterior space: If the 
observations leading to the current posterior already violate the budget, the reward of this state is 
set to 0. This transformation helps us ignore the budget in subsequent sections. 

Single-arm Policies. Given an execution of the global policy V, define its projection on arm i 
to be the policy V% defined by the actions induced on we term this a single-arm policy. Note 
that the global policy may take an action in arm i based on information regarding other arms (an 
entangled state) - that side information is lost in the projection. 

Let {V s (i, T)} describe all single-arm policies of arm i with feedback delay of <5j steps and 
horizon T. Each V s (i, T) is a (randomized) mapping from the current state to one of the following 
actions: (i) make a play; (ii) wait some number of steps (less or equal to T), so that when the 
result of a previous play is known, the policy changes state; (iii) wait a few steps and make a 
play (without extra information); or (iv) quit. The state of the system is captured by the current 
posterior u € Si, the plays with outstanding feedback, and the remaining time horizon. Note that 
the state encodes plays with outstanding feedback, and this has size 2 Si , which is exponential in 
the input. Ameliorating this dependence is an aspect that we address in our algorithm design. 

Definition 1. Given a single-arm policy Vi let R(Vi) to be the expected reward and N(Vi) as the 
expected number of plays, where the expectation is over the outcomes of the plays of Vi . 

3 The Truncation Theorem 

We now show that the time horizon of a single-arm policy on T steps can be reduced to f3T for 
constant j3 < 1 while sacrificing a constant factor in the reward. We note that though the statement 
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seems simple, this theorem only applies to single-arm policies and is not true for the global policy 
executing over multiple arms. The proof of this theorem uses the martingale structure of the 
rewards, and proceeds via a stopping time argument. 

Theorem 1. (The Truncation Theorem) For any arbitrary single arm policy V which traces out a 
path over a horizon of T steps in a state space S, then the identical policy V' that makes at least a 
(3 fraction of the plays on any decision path, satisfies: (i) R{V) > @R{V) and (ii) N(V') < N(V). 

Proof. Let the average reward in the initial state be fj,, and the prior is characterized by a distribu- 
tion f(fJ>). Consider the tree defined by the policy V. Let R(V(fi)), N(V(fi)) denote the expected 
reward and the number of plays of the policy V when the average reward is fixed at [i. We have 
R(P) = J R(P(fi))f(n)dfi. The critical part of the proof is the next claim which basically reduces 
to the fact that the decision to play does not affect the outcome. 

Claim 1. // we execute the policy P{^l) from any node u £ S the expected reward R(P(/i)) = 
M V ' fj,{v)length{v) where {y^v)} is the distribution induced on the vertices of S where P(fJ-) 
stops executing. The length(v) refers to the length of the path (in number of edges) from u to v. 

Proof. We observe that if we play at a node u', then we generate a reward \x. Now the probability 
of being at u' is the sum of the probability of the paths that pass through u' . 

^(-P(m)) = ^[reaching u'] = ^ ^ fi y^v) = ^fiy^v) length(v) 

«' «' v.u'is on a path to v, u' ^ v v 

The claim follows. □ 

(Continuing Proof of TheoremUty Truncating V(fi) to "P'(/u) induces a many to one map over the 
paths at which the policy stops. Let {^(v)}, W^v')} be the distribution induced on the vertices 
of S where V(n), V\\x) stop executing, respectively. 

If the path to v got truncated to g(v), then y'^{v') = Ylv.g(v)=v' V^i v )- Further, from the 
statement of the Lemma, length(g(v)) = /3T > (3length(v). Then using Claim [H on V'(fi), 

#CP'(M)) = length(v') = ^2^ ^ y^v) length(v') =^ u ^ y^v) length(g(v)) 

v' v' v:g{v)—v' v' v:g{y)—v' 

> X] V ^2 y ^ P length(v) = (3 ^ fiy M (v) length(v) = f3R(V(u)) 

v' v:g(v)—v' V 

The last part follows from applying Claim Q] on R(P(fj.)). Note that the rewards of each path is 
not preserved up to a factor. But the Claim Q] introduces an accounting method where the reward 
is redistributed over the length. Note that it was important that the outcome of the play did not 
depend on the decision to play. Integrating over /x, the result of R{V) follows. The truncation can 
not increase the expected number of plays, this proves the theorem. □ 



3.1 Finite Horizon MAB with Instantaneous Feedback 

As a direct application, we consider the Finite Horizon Bayesian Problem in the delay free setting. 
To formulate an LP relaxation, we find the optimal single-arm policy for each arm, so that the 
resulting ensemble of policies have expected number of plays at most T. This yields: 

LPl= max J YR(V s (i,T)) | N(P s (i, T)) <t\ 



V 3 (i,T) 



5 



The solution of LP1, denote this FH, is a collection of single arm policies V% such that FH < 
J2iR(Pi) and ^2iN(Pi) < T. We can compute these policies efficiently; see [201 [T8l IBl I3U] . 

Rounding to a Feasible Policy: The rounding scheme is now simple. (Variants have appeared 
before in several stochastic knapsack type optimization contexts [10\ [T8| [5].) Order the arms in 
order of 7^=ry- Play {Pi} m that order. If the decision in Vi is to quit, then we move to the next 
arm. If at any point we have already made T plays, we stop and quit the overall policy. 
We first prove a simple algebraic lemma. 

Lemma 1. Suppose ri/wi, i = 1,2,. . . ,n is a non-increasing sequence. Let w = Y2i w i- Then 



i<k 



1 



3<i 



> 



(J2i<k r i) (Si 



<k 



2 iv 



E TjWj 
2w 

i<k 



T,i>k W k 



i<k 



W 



Proof. The LHS is — of the area under the curve defined by the chain of points (0, rx),(wi, ri), 
Oi, ri+r 2 ), • • ., Wj, 12j<irj),(Ej<iV>j,?23<irj),. ■ -,(52j<k-l w 3>'E3<k r j)> and a horizontal 

line segment broken into two parts (Ylj<k w j> Ylj<k r j)>( w i Ylj<k r j) f° r convenience. 

The non-decreasing ordering implies that all the points in the curve (except the last one) are 



above the line y(%2j 



<k 



W ; 



x (Ej<k r j)- 



Therefore the area defined in the LHS dominates the sum of the areas defined by (i) the triangle 



defined by (0, 0), (Ej<fe w j^j<k r j)> (°> Y,j<k r j)' ( u ) the k ~ 1 sm&11 triangles defined by (Ej<j 

and (J2j<i w ji Ylj<i r j)- an(1 ( m ) The rectangle of length 



and height J2j<k ' j 



rj), (Ej<i w i> Ej<i r j) 
These are the three terms on the RHS and thus the lemma follows. 



j>k W 3 



□ 



Theorem 2. In time poly(T, \Si\), we can compute a 2 -approximation to the finite horizon 
Bayesian MAB problem with instantaneous feedback (with budgets and arbitrary priors). 

Proof. We show that the policy described above has an expected reward of ^FH. Let the number 
of plays of arm i be 2$. We know E[T$] = N(Vi) and ]T\ N{Vi) = T. We start playing arm i after 
J2j<iTj plays (if the sum is less than T); we apply the Truncation Theorem Q] and the expected 

reward of Vi continuing from this point onward is ^1 — ^ min{T, Ylj<iTj}j R{V%). Note that this 
is a consequence of the independence of arm i from ^j^Tj. Thus the total expected reward is 

i y j<i J i y j<i J 

= E ( 1 - \ E ) m) = E m) (i - Ej< ^ (Pj) 

i y j<i J i 

By LemmaQ] (with n = R(Vi), Wi = N(Vi) and k = n + l) this is at least \ ^ ■ R{Vj) = ^FH. □ 



Tight example of the analysis. We show that the gap of the optimum policy and LP1 is a 
factor of 2 — O(^), even for unit length plays. Consider the following situation: We have two 
"types" of arms. The type I arm gives a reward with probability a = 1/n and 1 otherwise. The 
type II arm always gives a reward 0. We have n independent arms. Each has an identical prior 
distribution of being type I with probability p = 1/n 2 and type II otherwise. Set T = n. 

Consider the symmetric LP solution that allocates one play to each arm; if it observed a 1, it 
plays the arm for n steps. The expected number of plays made is n + 0(l/n), and the expected 
reward is n x 1/n = 1. Therefore, LP1 > 1 — 0(l/n). 
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Consider the optimum policy. We first observe that if the policy ever sees a reward 1 then the 
optimum policy has found one of the type II arms, and the policy will continue to play this arm 
for the rest of the time horizon. At any point of time before the time horizon, since T = n, there is 
always at least one arm which has not been played yet. Suppose we play an arm and observe the 
reward 0, then the posterior probability of this arm being type II increased. So the optimum policy 
should not prefer this currently played arm over an unplayed arm. Thus the optimum policy would 
be to order the arms arbitrarily and make a single play on every new arm. If the outcome is 0, the 
policy quits, otherwise the policy keeps playing the arm for the rest of the horizon. The reward of 
the optimum policy can thus be bounded by ^^=0 ap(l + (T— x — l)a) = pa 2 T(T+\)/2+{\— a)/n = 
\ + 0{\). Thus the gap is a factor of 2 - 0(±). 

4 Multi-armed Bandits with Delayed Feedback 

Define Qi = T/5i. We assume that 5{ = o(T/ log T), which implies that the horizon is slightly 
separated from the delays. We show the following theorem. 

Theorem 3. Assuming 5i = o(T/ log T), there is a constant factor approximation to the Bayesian 
MAB problem with delayed feedback. The running time for computing this policy is poly(T, 

As in Section [3. H we will use (LP1) to bound of the reward of the best collection of single- arm 
policies. However, it is not clear how to solve the above LP in polynomial time, since each V s {i, T) 
has description which is exponential in the delay parameter 5{. We first simplify the structure of 
the single-arm policies to enable poly-time computation (Step 1). Even after this, it could happen 
that the policies make most of their plays after T/2 steps, so that the truncation theorem cannot 
be applied directly. We then show how to compact the policies (Step 2) to reduce the horizon - this 
compaction uses an interesting simulation argument. We then truncate the policies, and solve the 
LP relaxation over these well-structured policies (Step 3). We finally design a scheduling algorithm 
for combining these single-arm policies (Step 4) and analyze the approximation ratio. 

4.1 Step 1: Block Structured Policies 

Definition 2. A single-arm policy is said to be Block Structured if the policy executes in phases 
of size (25i + 1). At the start of each phase (or block), the policy makes at most 5i + 1 consecutive 
plays. The policy then waits for the rest of the block in order to obtain feedback on these plays, and 
then moves to the next block. A block is defined to be full if exactly 5i + 1 plays are made in it. 

We first show that all single-arm policies can be replaced with block structured policies while 
violating the time horizon by a constant factor. The idea behind this proof is simple - we simply 
insert delays of length 5{ after every chunk of plays of length 5i. 

Lemma 2. Any policy V(i,T) can be converted it to a Block Structured policy V'(i,2T) such that 
R(V(i,T)) < R(V'(i,2T)) and N(V'(i,2T)) < N(V(i,T)) 

Proof. We can assume that the policy makes a play at the very first time step, because we can 
eliminate any wait without any change of behavior of the policy. 

Consider the actions of V(i, T) for the first 5i + 1 steps, the result of any play in these steps 
is not known before all these plays are made. An equivalent policy V' simulates V for the first 
5i + 1 steps, and then waits for <5j steps, for a total of 25i + 1 steps. This ensures that V' knows the 
outcome of the plays before the next block begins. 
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Now consider the steps from Si + 2 to 2<5j + 3 of V(i, T). As V is executed, it makes some plays 
possibly in an adaptive fashion based on the outcome of the plays in the previous Si + 1 steps, 
but not on the current Si + 1 steps. V' however knows the outcome of the previous plays, and can 
simulate V for these Si + 1 steps and then wait for Si steps again. It is immediate to observe that 
V' can simulate V at the cost of increasing the horizon by a factor of ^py- < 2. Observe that in 
each block of 2<5, + 1, V' can make all the plays consecutively at the start of the block without any 
change in behavior. The budgets are also respected in this process. This proves the lemma. □ 

4.2 Step 2: Well-Structured Policies and Simulation 

The block-structured policies constructed above still suffer from the drawback that too many plays 
can be made close to the horizon T. Initially, the policy can be conservative and play very few 
times in each block waiting for feedback. In this part, we show how to compact such policies while 
preserving the reward and making sure the expected number of plays only increases by a constant 
factor. Our technique uses the idea of simulation - we make more plays initially, but hold on to 
the outcomes of the extra plays. When the original policy makes plays in a subsequent block, we 
eliminate these plays and instead use the outcomes of the saved up plays. As with the Truncation 
theorem, this argument crucially uses the martingale property that the reward of the arm are i.i.d. 
draws from the same unknown underlying distribution regardless of when the plays are made. 

Definition 3. Define a block-structured policy to be c-delay-free for c < 1 if the first time the policy 
encounters a block with at least cSi plays, it plays every step (without waiting) beyond this point 
( using feedback from Si plays ago ) until it stops executing. 

Intuitively, we are eliminating delays if a policy makes a sufficiently large number of plays in a 
block. This step not only compacts the policy, but also shrinks the state space significantly, since 
if the policy executes in a delay-free fashion, we can pretend the feedback is instantaneous. (In 
reality, the feedback is from Si plays ago, but we can easily couple the two executions.) We now 
show that any block-structured policy can be made c-delay-free. 

Lemma 3. Given any Block Structured policy V(i, 2T) we can construct a c-delay-free Block- 
structured policy V '(i, 2T), such that R(V(i,2T)) < R{V'{i,2T)) and N(V'(i,2T)) < (l+\)N(V{i,2T)). 

Proof. Consider the first time the policy encounters a block with cSi plays on some decision path. 
We play the arm continuously beyond this point and simulate the behavior of the original policy 
V. For each contiguous play the policy V' makes, this outcome is available after Si steps. Simulate 
the next play of V using this outcome, until V stops execution. Clearly, V' makes at most Si 
additional plays than V on any decision path; since V made at least cSi plays, this shows that 
N(V'(i,2T)) < (1 + \)N(V{i,2T)). Since the execution of V 1 is coupled play-by-play with the 
execution of V, it is clear that R(V(i, 2T)) < R(V'(i, 2T)). □ 

The policies constructed above still suffer from being too sparse if the number of plays in each 
block is less than c<5j. In this step, we will compact these policies further to retain mostly the full 
blocks. Note that any policy uses at most Qi = j- = ui(\ogSi) blocks. 

Definition 4. For constant a < 1, define a c-delay-free block-structured single-arm policy V to be 
(a, c) -well-structured if after encountering at most q = {a + o{l))Qi blocks, the policy switches to 
playing continuously, i.e., executing in delay- free mode. 

The next lemma further compacts the block-structured policies into well-structured policies. 
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Lemma 4. For any a < 1 and c < given a c- delay- free policy V(i, 2T), there is a (a, c) -well- 
structured policy V such that R(P(i,2T)) < R(V'(i,2T)) and N(V'(i,2T)) < (1 + %)N{V(i,2T)). 

Proof. Consider the execution of V . All blocks in this policy have at most c<5j plays, unless the 
policy is executing in delay-free mode. We consider the execution of the original policy, and show 
a coupled execution in the new policy V so that if on a decision path, V used k blocks, then the 
number of blocks on the same decision path in V 1 is ak + 0(log<5j). We group blocks into size 
classes; size class s has blocks whose number of plays lies in [2 S ,2 S+1 ]. 

We couple the executions as follows: Consider the decision tree V top-down. Increase the plays 
in the root block r by a factor of (1 + — ); suppose there were x plays originally in this block, and 
the size class of this block is s. Consider the first - blocks on each decision path downstream of 
r whose size class is at most s; eliminate these blocks and use the outcomes of the ^ extra plays 
made at r to simulate the behavior of V in these blocks as follows. Simply store the outcomes of 
the extra plays at r (without updating the prior), and use these outcomes when V makes those 
plays. Since the underlying reward distribution is fixed, these plays will be stochastically identical. 
We note that V' does not really need to "store" the outcomes - it can simply update the prior at 
r and follow the execution of V as if it had not updated it. Again, since the reward process is a 
martingale, both these executions are stochastically identical. 

On each decision path, either we used r to eliminate — blocks, or there are no blocks left with 
size class at most s. Now repeat this procedure on the roots of each decision sub-tree downstream 
of r. When this procedure terminates, consider any decision path, and mark the leftover blocks 
whose increase in plays could not eliminate exactly i blocks. For any size class s, it is clear that 
there can be at one such block, else the earlier block could have eliminated the later one. Therefore, 
if a decision path at k blocks in V, the above procedure shrinks the number of blocks to at most 
ak+ log ^ - , since there are at most log 2 S size classes. This procedure increases the expected number 
of plays by at most a factor of 1 + — , and we have a guarantee that each decision path makes at 
most (a + o(l))Qi plays, since log<5j = o(Qj). □ 

4.3 Step 3: Truncation and Solving the LP Relaxation 

We now truncate the horizon of the well-structured policies losing a constant factor in the reward. 

Lemma 5. Given any (1/8, 1/17) -well- structured single arm policy V = V(i,2T), there is an 
identical policy that stops execution after ? times steps ( denoted by V ), that satisfies the following: 
(i) R(V') = ^p- and (11) N{V) < N{V). 

Proof. On any decision path in V, there can be at most Qj(l/8 + o(l)) blocks before the policy 
executes in delay-free mode. These blocks take up at most T/4 time steps, and the plays beyond 
that are delay-free. This implies that if the policy's horizon is truncated to T/2 time steps, and the 
policy has not stopped executing, it makes at least T/4 contiguous plays, whereas V could have 
made at most 2T plays. Therefore, on any decision path, the number of plays is reduced by at 
most a factor of 8, and by the Truncation theorem [H the lemma follows. □ 

We term well-structured policies that terminate on encountering more than T/2 time steps as 
truncated well-structured policies. The above lemmas can be summarized as: 

Theorem 4. (LP1) has a 1 / a- approximation over truncated (a, -^2) -well structured policies for 
a < 1/8. This LP has the relaxed constraint: T,i N (Ps(hT/2)) < ^T where 7 = 2(l+l/a)(l+2/a). 
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We now formulate a new LP relaxation over truncated well-structured policies constructed in 
Theorem HI This new LP is polynomial over |«Sj| and we avoid the exponential dependence on Si 
due to the above transformations. Towards this end, we explicitly write out the state space of a 
truncated (a, c)-well-structured policy for a single arm. For u £ Si, we create a state W(u, t) for 
1 < t < T/2. The meaning of this state is as follows: The posterior at the beginning of a block is 
u G Si, and t time steps have elapsed so far. This block is either regular (in which case, at most 
aQi blocks have elapsed so far), or no-delay. For no-delay blocks, we can assume the block-length 
is 1 and feedback is instantaneously available; this unifies the presentation of the LP. (The feedback 
is actually from plays made Si steps ago, but we can couple the two executions.) 

For any state a = W(u,t) of arm i, define the following quantities. These quantities are an 
easy computation from the description of Si, and the details are omitted. 

1. Let ri{o,€) denote the expected reward obtained when consecutive £ plays are made at state 
a, and feedback is obtained at the very end. Note that for regular state a, we have < £cSi. 
For a no-delay state, < £ < 1. 

2. Let Pi(cr, a' , £) denote the probability that if the state at the beginning of a block is a, and £ 
plays are made in the block, the state at the beginning of the next block is a'. 

We will formulate an LP to find a randomized truncated well-structured policy V . Define the 
following variables over the decision tree of the policy: 

• %i a : the probability that the state for arm i at the start of a block is a. 

• Viaf- probability that the policy for arm i makes £ consecutive plays starting at state a. 

We have the following LP relaxation, which simply encodes finding one randomized well- 
structured policy per arm so that the expected number of plays made is at most 7T. The objective 
to this LP is 0,(0 PT) as a consequence of Theorem 01 

LP2 = Maximize rj (cr, €)yi a i 
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The LP policy: We scale down the LP solution by a factor of 7 so that the final constraint becomes 
^2i a i£yicr£ < T; denote the new LP as (LP2s) for convenience. (LP2s) yields one randomized 
well-structured policy T ,r (i,T/2) for each arm i. This policy is succinctly described in Figured! 

Policy V r (i,T/2): 

• If the state at the beginning of a block is a: 

1. Choose n with probability and make n plays in the current block. 

2. Wait till the end of the block; obtain feedback for the n plays; and update state. 

Figure 1: Single-arm policy V r (i,T/2). 
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4.4 Step 4: Priority Based Combining of Different Policies 

At this point we have a collection of randomized policies T ,r (i, T/2) such that Y2i N(T >r (i, T/2)) < T 
and R(V r (i, T/2)) = 0,{OPT). We now show how to combine the single arm policies V r (i, T/2) 
to achieve a globally approximate and feasible solution. This step is not complicated but we note 
that multiple policies must remain active in the combination. This aspect necessitates a novel 
priority based scheme for combining the policies. 

Consider the execution of V r (i, T/2). We describe the arm as active if it is either making plays 
or is at the beginning of a block where it can make plays; and passive if it is waiting for feedback 
on the plays within the block. Any arm which completed its waiting for the feedback turns from 
passive to active mode. The final policy is shown in Figure [2j 

1. Choose an arbitrary order the arms {«} denoted by 7r. 

2. Each arm "participates" with probability 1/4. Initially, all participating arms are active. 

3. On each new play, among all participating arms: 

(a) Find the lowest rank arm, say i' , that is active. 

(b) Allocate the current play to i' according to the policy V r (i' ,T/2). 

(c) (As a result of this allocation, the arm may become passive and wait for feedback.) 

Figure 2: The final policy Combine. 

The next lemma follows from an application of Markov's inequality in the same spirit as prior 
work without delayed feedback [18] . 

Lemma 6. The expected contribution of i from the combined policy is at least R(V r (i,T/2))/8. 

Proof. First observe that i being active does not impact any play of Gi. The expected number of 
plays made by Gi is X^'eG N(V' r {i\ T/2))/ A < T/4. Thus by Markov's Inequality, the arms in Gi 
make T/2 plays with probability at most 1/2. Next note that with probability 1/8, arm i is active 
and in a scenario where the number of plays made by bidders before i is at most T/2. This means 
every decision path of V r (i,T/2) can complete in this scenario (since i gets higher priority that 
all bidders following it). Thus with probability 1/8 the policy V r (i,T/2)) executes without any 
interference. Thus the expected contribution from arm i is at least R(V r (i,T/2))/8. □ 

The above lemma shows Theorem [3] by linearity of expectation, concluding the analysis. 
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A Budgeted Allocation with Delayed Feedback 

Recall that the budgeted allocation problem is a collection of MAB instances, one for each impres- 
sion type. For each impression of type j, the MAB instance M(Tj) is over Tj arrivals (or plays) 
of this impression. (Note that ^2jTj = T.) The arms for this impression are denoted where 
i denotes the bidders. If a play along (i,j) is successful (which is observed after 5ij arrivals of 
impressions of type j), bidder i accrues reward b^; the probability of success is drawn from an 
independent prior T>ij- The arms are connected by the budget constraint stating that the total 
reward accrued by bidder i across all MAB instances is at most B{. 

Algorithm and Analysis: The overall algorithm is similar to that for a single MAB instance. 

1. The initial LP relaxation (that we don't solve) encodes constructing one randomized policy 
per arm (i,j) such that: (i) The total expected number of plays for MAB Mj is at most Tj] 
and (ii) The total expected reward accrued for bidder i is at most B. L . Note that the latter 
constraint was not present in (LP1) in the previous section. 

2. With a loss of factor 2 in the reward, we define "shadow budgets" B^ < Bi such that B^jbij 
are integers, as discussed in Section EU 

3. For each (i,j), we transform the single-arm policy (with the budgets B^) just as in the 
previous section into a well-structured policy using Tj/2 impressions. 

4. We solve an LP over truncated well-structured policies, with the additional constraint that 
the total reward accrued for bidder i is at most Bi. Let the LP objective value be M. 

5. To construct the global policy, we run the policy Combine independently for each j. Let 
the randomized well-structured policy for arm that is constructed by the LP be denoted 
V r (i,j, Tj/2), and let Z+j denote the reward achieved. Note Z^ < Bij < Bi. 

The analysis in Section [3] shows that the policy Combine yields expected reward Yli -E[Zjj] = 
0,(M). However we can only extract a reward of min{£>j, ^ • Z^} from advertiser %. We show: 

Lemma 7. For a collection of random variables {Z^}, £/$^,-E[.Zy] < Bi and Zij < Bi then 

F/mm{/;;.v z, ; !; >|E,E[i%]. 

Proof. Consider a process where advertiser i settles the bill for j, before proceeding to the bill for 
j + 1. If the payout on j is Yj then Yj = min |Zy, max{l?j — Y2j'<j Zij'> 0}}- 

Z ii ( ■< ^j'<j Zi J' „\ \ ( . J2j'<j Z ij 



Yj = Bi min j-^, max ^1 - ^ J ^ Oj j > max ^1 - ^ J ^ Oj 

The last part follows from min{a, b} > ab, for any < a, b < 1. Clearly ^2jYj < Bi in every 
scenario. And as a consequence of Yj being independent of Zy for j' < j: 



E[Yj] > E 



max 1 — 



K Z, ■ 1 1 ^ * Z " ) H Z, 



Bi 

We can estimate ^ . E[Yj] to be at least times the area under the staircase defined by 

(o,e[z 41 ]),. . . , (J2 e[z 3 ], j2 nza}), (E E ^']' E E ^'])- (E E ^]'E E ^])' (^,E E ^]) 

j"'<J j'<3 j'<3 j'<3 3 3 3 

This area includes the triangle defined (0, 0), (Bi, 0) and (Bi, ^ • E[Zjj]) and thus the reward is 
therefore at least ^t|^ V E[%] = jEjW- This 

proves the lemma. □ 
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As a consequence of Lemma [71 and that M = Q(OPT) we have: 

Theorem 5. IfTj > 48(85jj + 2) log Tj for all , j then in time polynomial in n, m, and maxjj \Sij\, 
we can compute a policy whose expected reward is a 0(1) approximation to the expected reward of 
the optimal policy for the Bayesian budgeted allocation problem with delayed feedback. 

B Example of Priors: Beta Distributions 

Consider a fixed but unknown distribution over two outcomes (success or failure of treatment, click 
or no click, conversion or no conversion). This is a Bernoulli(l, 9) trial where 9 is unknown. One 
way of encoding the uncertainty about 9 is to use a distribution which encodes our "prior" belief 
regarding 9. However on getting a new sample, we update our belief to a "posterior distribution" 
using Bayes' rule. Note that we need a family of distributions and associated update rules to specify 
the prior. For two valued outcomes, one such family is the Beta distribution, referred to as the 
conjugate prior of the Bernoulli distribution. A Beta distribution with parameters ao,a\ € Z + , 
which we denote Beta(a\,ao) has p.d.f. of the form c9 ai ~ 1 (l — 9) a °~ 1 , where c is a normalizing 
constant. Beta(l, 1) is the uniform distribution. The distribution Beta(a\,ao) corresponds to the 
current (posterior) distribution over the possible values of 9 after having observed (a\ — 1) l's 
and (ceo — 1) O's, starting from the belief that 9 was uniform, distributed as Beta(l, 1). Given the 
distribution Beta(ai, ao) as our prior, the expected value of 9 is a °^ ao . In this example, updating 
the prior on a sample is straightforward. On seeing a 1, the posterior (of the current sample, 
and the prior for the next sample) is Beta(ai + 1, «q). On seeing a 0, the new distribution is 
Beta(ai, ao + 1). As stated earlier, the hard case for our algorithms (and the case that is typical 
in practice) is when the input to the problem is a set of arms {i} with priors T>i ~ Beta(au, aoj) 
where aoi 3> a u which corresponds to a set of poor prior expectations of the arms. 

Representing Priors as a DAG. Given a family of priors we model them as a directed acyclic 
graph (DAG) where the the vertices correspond to the priors and posteriors. The root p encodes the 
initial prior T>. The children of p correspond to the possible observations of a sample drawn from T>. 
In the case of Beta priors and two valued outcomes, a node u corresponding to Beta(ai, ao) would 
have two children corresponding to Beta(a\ + 1, ao),Beta{a\, ao + 1) and the edges will be labeled 
1,0 respectively. The probability of taking the edge labeled 1 from u to the node v corresponding 
to Beta{a\ + 1, ao) would be the probability of observing a 1 given the prior Beta(ai, ao), which 
is a^+a • ^ e wm define p uv = a ^; a ■ At each node u the probability of observing 1 corresponds 
to the "reward" p, u . Thus the DAG represents the evolution of priors/posteriors completely, and 
further (for two valued outcomes) each node is represented by a pair of numbers. Therefore for a 
horizon of size T, the entire information is captured by a DAG of size 0(T 2 ). 

Martingale Property: A critical observation, true of all conjugate prior distributions and the 
resulting DAGs, is that we have a Martingale property: p u = Ylves M«Pu« (summing over the 
children of u). Therefore this gives us a way of switching between two different views of the same 
evolution of the posterior space: the first view is where the pi (equivalently Di ) is chosen according 
to T>i and we observe a collection of trajectories and take expectation over them, and the second 
view is where we have a Martingale process of the posterior spaces in the DAG. 
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